Skip to content

Conversation

@fairydreaming
Copy link
Collaborator

Fixes #6877

Contains the following changes:

  • increases maximum number of experts from 60 to 128
  • adds new tensor type FFN_NORM_EXP (for a normalization block before MoE that runs in parallel to the attention + FFN, see Add support to ArcticForCausalLM #6877 for details)
  • introduces architecture-specific block mappings in gguf-py (details in Add support to ArcticForCausalLM #6877)
  • adds new model type MODEL_10B_128x3_66B
  • adds new ARCTIC architecture and a general support for models based on this architecture

Model files for testing: https://huggingface.co/sszymczyk/snowflake-arctic-instruct-GGUF

@github-actions
Copy link
Contributor

github-actions bot commented May 1, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8432.03ms p(95)=21124.36ms fails=, finish reason: stop=502 truncated=53
  • Prompt processing (pp): avg=94.97tk/s p(95)=428.46tk/s
  • Token generation (tg): avg=47.12tk/s p(95)=47.24tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=snowflake-arctic-clean commit=602c80d918e609f8bd5120fcd346242ed2da5f74

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716550652 --> 1716551278
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 334.75, 334.75, 334.75, 334.75, 334.75, 863.81, 863.81, 863.81, 863.81, 863.81, 868.82, 868.82, 868.82, 868.82, 868.82, 871.14, 871.14, 871.14, 871.14, 871.14, 904.55, 904.55, 904.55, 904.55, 904.55, 909.66, 909.66, 909.66, 909.66, 909.66, 905.19, 905.19, 905.19, 905.19, 905.19, 914.01, 914.01, 914.01, 914.01, 914.01, 900.79, 900.79, 900.79, 900.79, 900.79, 901.79, 901.79, 901.79, 901.79, 901.79, 922.29, 922.29, 922.29, 922.29, 922.29, 960.88, 960.88, 960.88, 960.88, 960.88, 961.82, 961.82, 961.82, 961.82, 961.82, 894.05, 894.05, 894.05, 894.05, 894.05, 873.11, 873.11, 873.11, 873.11, 873.11, 876.76, 876.76, 876.76, 876.76, 876.76, 878.0, 878.0, 878.0, 878.0, 878.0, 883.5, 883.5, 883.5, 883.5, 883.5, 891.8, 891.8, 891.8, 891.8, 891.8, 892.81, 892.81, 892.81, 892.81, 892.81, 897.42, 897.42, 897.42, 897.42, 897.42, 898.64, 898.64, 898.64, 898.64, 898.64, 899.0, 899.0, 899.0, 899.0, 899.0, 908.42, 908.42, 908.42, 908.42, 908.42, 908.69, 908.69, 908.69, 908.69, 908.69, 907.17, 907.17, 907.17, 907.17, 907.17, 897.12, 897.12, 897.12, 897.12, 897.12, 893.81, 893.81, 893.81, 893.81, 893.81, 892.62, 892.62, 892.62, 892.62, 892.62, 897.8, 897.8, 897.8, 897.8, 897.8, 895.68, 895.68, 895.68, 895.68, 895.68, 893.83, 893.83, 893.83, 893.83, 893.83, 896.62, 896.62, 896.62, 896.62, 896.62, 903.93, 903.93, 903.93, 903.93, 903.93, 907.39, 907.39, 907.39, 907.39, 907.39, 906.57, 906.57, 906.57, 906.57, 906.57, 899.74, 899.74, 899.74, 899.74, 899.74, 896.94, 896.94, 896.94, 896.94, 896.94, 895.63, 895.63, 895.63, 895.63, 895.63, 896.21, 896.21, 896.21, 896.21, 896.21, 897.06, 897.06, 897.06, 897.06, 897.06, 868.95, 868.95, 868.95, 868.95, 868.95, 859.92, 859.92, 859.92, 859.92, 859.92, 857.2, 857.2, 857.2, 857.2, 857.2, 856.32, 856.32, 856.32, 856.32, 856.32, 859.85, 859.85, 859.85, 859.85, 859.85, 861.05, 861.05, 861.05, 861.05, 861.05, 859.59, 859.59, 859.59, 859.59, 859.59, 862.21, 862.21, 862.21, 862.21, 862.21, 861.3, 861.3, 861.3, 861.3, 861.3, 863.12, 863.12, 863.12, 863.12, 863.12, 860.35, 860.35, 860.35, 860.35, 860.35, 861.79, 861.79, 861.79, 861.79, 861.79, 865.67, 865.67, 865.67, 865.67, 865.67, 866.09, 866.09, 866.09, 866.09, 866.09, 865.61, 865.61, 865.61, 865.61, 865.61, 866.4, 866.4, 866.4, 866.4, 866.4, 867.39, 867.39, 867.39, 867.39, 867.39, 868.18, 868.18, 868.18, 868.18, 868.18, 868.62, 868.62, 868.62, 868.62, 868.62, 868.05, 868.05, 868.05, 868.05]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716550652 --> 1716551278
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.25, 43.25, 43.25, 43.25, 43.25, 26.51, 26.51, 26.51, 26.51, 26.51, 28.18, 28.18, 28.18, 28.18, 28.18, 31.79, 31.79, 31.79, 31.79, 31.79, 32.94, 32.94, 32.94, 32.94, 32.94, 33.5, 33.5, 33.5, 33.5, 33.5, 34.81, 34.81, 34.81, 34.81, 34.81, 34.94, 34.94, 34.94, 34.94, 34.94, 34.99, 34.99, 34.99, 34.99, 34.99, 34.96, 34.96, 34.96, 34.96, 34.96, 34.87, 34.87, 34.87, 34.87, 34.87, 34.85, 34.85, 34.85, 34.85, 34.85, 33.54, 33.54, 33.54, 33.54, 33.54, 33.52, 33.52, 33.52, 33.52, 33.52, 32.27, 32.27, 32.27, 32.27, 32.27, 30.91, 30.91, 30.91, 30.91, 30.91, 30.85, 30.85, 30.85, 30.85, 30.85, 31.21, 31.21, 31.21, 31.21, 31.21, 31.15, 31.15, 31.15, 31.15, 31.15, 30.93, 30.93, 30.93, 30.93, 30.93, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 31.08, 31.08, 31.08, 31.08, 31.08, 30.95, 30.95, 30.95, 30.95, 30.95, 31.07, 31.07, 31.07, 31.07, 31.07, 31.15, 31.15, 31.15, 31.15, 31.15, 31.1, 31.1, 31.1, 31.1, 31.1, 31.06, 31.06, 31.06, 31.06, 31.06, 31.38, 31.38, 31.38, 31.38, 31.38, 31.58, 31.58, 31.58, 31.58, 31.58, 31.66, 31.66, 31.66, 31.66, 31.66, 31.69, 31.69, 31.69, 31.69, 31.69, 31.84, 31.84, 31.84, 31.84, 31.84, 31.85, 31.85, 31.85, 31.85, 31.85, 31.64, 31.64, 31.64, 31.64, 31.64, 31.41, 31.41, 31.41, 31.41, 31.41, 30.95, 30.95, 30.95, 30.95, 30.95, 30.93, 30.93, 30.93, 30.93, 30.93, 30.97, 30.97, 30.97, 30.97, 30.97, 31.06, 31.06, 31.06, 31.06, 31.06, 31.16, 31.16, 31.16, 31.16, 31.16, 31.32, 31.32, 31.32, 31.32, 31.32, 31.07, 31.07, 31.07, 31.07, 31.07, 30.61, 30.61, 30.61, 30.61, 30.61, 30.11, 30.11, 30.11, 30.11, 30.11, 29.98, 29.98, 29.98, 29.98, 29.98, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.9, 29.9, 29.9, 29.9, 29.9, 29.91, 29.91, 29.91, 29.91, 29.91, 29.99, 29.99, 29.99, 29.99, 29.99, 29.96, 29.96, 29.96, 29.96, 29.96, 29.9, 29.9, 29.9, 29.9, 29.9, 29.83, 29.83, 29.83, 29.83, 29.83, 29.9, 29.9, 29.9, 29.9, 29.9, 30.0, 30.0, 30.0, 30.0, 30.0, 30.09, 30.09, 30.09, 30.09, 30.09, 30.16, 30.16, 30.16, 30.16, 30.16, 30.28, 30.28, 30.28, 30.28, 30.28, 30.26, 30.26, 30.26, 30.26, 30.26, 30.27, 30.27, 30.27, 30.27]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716550652 --> 1716551278
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.27, 0.27, 0.27, 0.27, 0.27, 0.41, 0.41, 0.41, 0.41, 0.41, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.26, 0.26, 0.26, 0.26, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.34, 0.34, 0.34, 0.34, 0.34, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.38, 0.38, 0.38, 0.38, 0.38, 0.64, 0.64, 0.64, 0.64, 0.64, 0.44, 0.44, 0.44, 0.44, 0.44, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716550652 --> 1716551278
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0]
                    
Loading

@sorasoras
Copy link

It's possible to only offload dense part of the model onto GPU

model_arch = gguf.MODEL_ARCH.ARCTIC

def set_vocab(self):
self._set_vocab_llama_hf()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #6877 (comment), this should be:

Suggested change
self._set_vocab_llama_hf()
try:
self. _set_vocab_sentencepiece()
except FileNotFoundError:
self._set_vocab_llama_hf()

The assertion exists because LlamaHfVocab was primarily written to convert HF "fast" tokenizers with a tokenizer.json. Since before it existed, "slow" sentencepiece tokenizers with a tokenizer.model have (almost?) always been converted using SentencePieceProcessor, which doesn't depend on HF transformers and directly preserves the token types and scores.

If you want to start converting slow tokenizers using HfVocab as well, I won't stop you, but in order to be consistent you'd have to remove all references to SentencePieceProcessor in the convert scripts, and make HF transformers a hard requirement for converting models with a Llama vocab. Otherwise, we'd be making an exception for this model for no clear reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason is that the official tokenizer.model file for snowflake-arctic-instruct contains wrong BOS and EOS tokens as confirmed in: https://huggingface.co/Snowflake/snowflake-arctic-instruct/discussions/12
That's why I used llama_hf vocab that reads tokens from json files instead. If there is a better solution for this I'm fully open to any suggestions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cebtenzzre What if I implement ArcticModel::set_vocab() myself like XverseForCausalLM did, is that acceptable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cebtenzzre I now load vocabulary with SentencePieceProcessor as you suggested and apply necessary token modifications based on added_tokens_decoder field from tokenizer_config.json.

@mofosyne mofosyne added enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
Comment on lines 454 to 455
if arch in self.arch_block_mappings_cfg:
block_mappings = self.arch_block_mappings_cfg[arch]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means architecture-specific block mappings can't partially override the common mappings (they have to totally re-define everything)?

Maybe this is fixable by adding the common mappings first to self.mapping, then the architecture-specific mappings?

So maybe using the union operator for dicts would be appropriate here

if arch in self.arch_block_mappings_cfg:
    block_mappings = self.block_mappings_cfg | self.arch_block_mappings_cfg[arch]

But that's only supported since Python 3.9, and gguf-py targets python = ">=3.8"

In this case using {**x, **y} instead of x | y would be more compatible for older-than-3.9 versions of Python, and would allow making a new dict with the content of x augmented/overridden by y. But the new syntax is clearer in my opinion.

After that, the architecture-specific mapping of MODEL_ARCH.ARCTIC should be simpler (since they won't need to include duplicates of the common mappings).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is to keep only "conflicting" block mappings in architecture-specific mappings and "non-conflicting" mappings in general mappings? I think using dict.update() is a better idea then. Mappings for ARCTIC arch would be shortened to:

    # architecture-specific block mappings
    arch_block_mappings_cfg: dict[MODEL_ARCH, dict[MODEL_TENSOR, tuple[str, ...]]] = {
        MODEL_ARCH.ARCTIC: {
            MODEL_TENSOR.FFN_NORM: (
                "model.layers.{bid}.residual_layernorm",
            ),
            MODEL_TENSOR.FFN_NORM_EXP: (
                "model.layers.{bid}.post_attention_layernorm",
            ),
        },
    }

while in the TensorNameMap init we would only have to add:

        if arch in self.arch_block_mappings_cfg:
            self.block_mappings_cfg.update(self.arch_block_mappings_cfg[arch])

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is to keep only "conflicting" block mappings in architecture-specific mappings and "non-conflicting" mappings in general mappings?

Yes, exactly.

What do you think?

I think using dict.update() would be good. My proposed approach would have made a copy of the dict, but you're right, updating in-place would work too and would be better, since the original block_mappings_cfg isn't used later on (I think?).

I agree with using dict.update() for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not test this (the model is quite big), but the code looks good to me. Nice work @fairydreaming!

SSM_A = auto()
SSM_D = auto()
SSM_OUT = auto()
FFN_NORM_EXP = auto()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the actual numbers associated to the enum values of MODEL_TENSOR don't really matter (their names (from TENSOR_NAMES) are used instead in GGUF), maybe FFN_NORM_EXP could be placed right before FFN_GATE_EXP, a bit like FFN_NORM is right before FFN_GATE, for consistency.

If this is changed, it should also be placed similarly in TENSOR_NAMES and MODEL_TENSORS[MODEL.ARCTIC] in gguf-py/gguf/constants.py as well as in the llm_tensor enum, the LLM_TENSOR_NAMES mapping, and the llama_layer struct (and maybe the LLM_ARCH_ARCTIC case in llm_load_tensors?) in llama.cpp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order as requested, but in llama_layer struct the order is different, so I didn't touch it. In llm_load_tensors I think it was already in the requested order.

@fairydreaming
Copy link
Collaborator Author

I noticed that the arctic model doesn't use bias tensors, so I removed usage of bias tensors in the LLM_ARCH_ARCTIC-related code (they were all nulls anyway).

@ggerganov
Copy link
Member

I haven't tested as well, but it seems good so feel free to merge

@github-actions github-actions bot added the python python script changes label May 22, 2024
@fairydreaming
Copy link
Collaborator Author

fairydreaming commented May 23, 2024

I haven't tested as well, but it seems good so feel free to merge

@ggerganov I noticed that Snowflake changed the Arctic model 2 weeks ago. The commit says: "Fixes for GQA support" and num_key_value_heads in config.json changed value from 56 to 8, so I have to redownload the model and check if it still works.

@fairydreaming fairydreaming merged commit fbca2f2 into ggml-org:master May 24, 2024
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request May 24, 2024
Add support for ArcticForCausalLM (ggml-org#7020)
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request May 24, 2024
* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support to ArcticForCausalLM

7 participants